### Setup
%matplotlib inline
# %load_ext pretty_jupyter
# should enable plotting without explicit call .show()
# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
from matplotlib.colors import LogNorm
# classes for special types
from pandas.api.types import CategoricalDtype
# Apply the default theme
sns.set_theme()
Introduction¶
In this report, we're going to predict how many goals will NHL players score next season. The prediction will be based on various player's data, such as average ice-time, age, shooting percentage, etc.
Dataset overview¶
The provided dataset consists of 2 csv files:
nhl-teams.csv:¶
This file contains names of NHL teams together with their short name. There are 35 rows (1 row = 1 team) and only 2 columns. The structure is as follows:
# Reading and inspecting data
df = pd.read_csv("data/nhl-teams.csv")
df.head(5)
| team | team_full | |
|---|---|---|
| 0 | ANA | Anaheim Ducks |
| 1 | ARI | Arizona Coyottes |
| 2 | ATL | Atlanta Thrashers |
| 3 | BOS | Boston Bruins |
| 4 | BUF | Buffalo Sabres |
nhl-player-data.csv¶
This file contains various data about NHL players (goals, points per season, average time on ice, age, etc.) for seasons 2004-2018. Each row contains data for the given player per season, i.e. if the player has played multiple seasons in NHL, there will be multiple rows containing his data (1 row per each season).
Size of the dataset is about 1.34 MB
There are 12328 rows and 32 columns
Structure is as follows:
df = pd.read_csv("data/nhl-player-data.csv")
df.head(5)
| Rk | Player | Nick | Age | Pos | Tm | GP | G | A | PTS | ... | TOI | ATOI | BLK | HIT | FOW | FOL | FO_percent | HART | Votes | Season | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Connor McDavid | mcdavco01 | 20 | C | EDM | 82 | 30 | 70 | 100 | ... | 1733 | 21.133333 | 29.0 | 34 | 348.0 | 458.0 | 43.2 | 1 | 1604 | 2017 |
| 1 | 2 | Sidney Crosby | crosbsi01 | 29 | C | PIT | 75 | 44 | 45 | 89 | ... | 1491 | 19.883333 | 27.0 | 80 | 842.0 | 906.0 | 48.2 | 0 | 1104 | 2017 |
| 2 | 3 | Patrick Kane | kanepa01 | 28 | RW | CHI | 82 | 34 | 55 | 89 | ... | 1754 | 21.400000 | 15.0 | 28 | 7.0 | 44.0 | 13.7 | 0 | 206 | 2017 |
| 3 | 4 | Nicklas Backstrom | backsni02 | 29 | C | WSH | 82 | 23 | 63 | 86 | ... | 1497 | 18.266667 | 33.0 | 45 | 685.0 | 648.0 | 51.4 | 0 | 60 | 2017 |
| 4 | 5 | Nikita Kucherov | kucheni01 | 23 | RW | TBL | 74 | 40 | 45 | 85 | ... | 1438 | 19.433333 | 20.0 | 30 | 0.0 | 0.0 | 0.0 | 0 | 119 | 2017 |
5 rows × 32 columns
Types of the columns are displayed below:
df.dtypes
Rk int64 Player object Nick object Age int64 Pos object Tm object GP int64 G int64 A int64 PTS int64 plusminus int64 PIM int64 PS float64 EV int64 PP int64 SH int64 GW int64 EV.1 int64 PP.1 int64 SH.1 int64 S int64 S_percent float64 TOI int64 ATOI float64 BLK float64 HIT int64 FOW float64 FOL float64 FO_percent float64 HART int64 Votes int64 Season int64 dtype: object
As we mentioned before, we're mostly interested in predicting goals per season. The distribution of goals per season in our dataset looks like this:
g = sns.histplot(data=df, x="G", binwidth=5)
plt.xlabel("Goals per season")
plt.ylabel("Count of players")
plt.show()
Here are the descriptive statistics:
df["G"].describe()
count 12328.000000 mean 7.484263 std 8.846936 min 0.000000 25% 1.000000 50% 4.000000 75% 11.000000 max 65.000000 Name: G, dtype: float64
We can see, that the average player scores about 7.5 goals per season, the maximum is 65 goals per season, and over 20% of players didn't score any goal in a season.
The graph below shows distribution of players by season.
g = sns.histplot(data=df, x="Season", discrete=True)
We can see that the data are evenly distributed between seasons, however data for 2005 are missing because the season was cancelled.
Missing values¶
In total there are 438 missing cells in the dataset, which is about 0.1% of all cells. These are mostly concentrated in two columns:
- FO_percent (faceoff win percentage)
- S_percent (shooting percentage)
In my opinion, this is most likely caused by the fact that the given players didn't take any faceoffs during the season (e.g. defensemen typically don't take faceoffs), or respectively didn't shoot at the goal (maybe the given player played just 1 game in whole season). Both cases will result in division by zero.
There is also 1 missing value in each of the following columns: BLK, FOW and FOL.
Missing values aren't denoted by any special strings (such as "None" or "Null"), there are just 2 consecutive commas in the given row.
There is one row with player whose age is 0, which is definitely an error in the dataset.
Exploratory analysis¶
We're going to explore relations in the dataset.
Note: Players who scored 0 goals in a given season are excluded from all graphs.
Goals & player's age¶
Firstly, we're going to explore whether the player's age correlates with the goals scored in any way.
df = df.loc[df["G"] > 0]
g = sns.JointGrid(data=df, x="Age",y="G")
g1 = g.plot(sns.histplot, sns.histplot, discrete=True)
g2 = g1.set_axis_labels("Age", "Goals per season")
From the graph above, we see that suprisingly, there is no visible correlation between these 2 variables. The best results belong to players aged 20-25 results, however these players are also the most represented group in the dataset.
Goals & shooting percentage¶
In this part, we'll explore whether the goals scored depend on player's shooting percentage.
Here is a graph showing relation of these 2 variables.
df_scored = df.loc[df["S_percent"] > 0]
g = sns.JointGrid(data=df_scored, x="S_percent",y="G")
g1 = g.plot(sns.histplot, sns.histplot, binwidth=1)
g2 = g1.set_axis_labels("Shooting percentage", "Goals per season")
From the graph, it seems that the number of goals directly depends on shooting percentage. So, we can formulate the following hypothesis: The better shoting percentage, the more goals scored.
Of course, this doesn't apply to all data (e.g. there are several players with shooting percentage >= 50% and none of them scored more than 5 goals), however on average it seems to be true.
Goals & average time on ice¶
In this section, we're going to explore relation between goals scored and average time on ice.
Here is the graph showing relation between these 2 variables:
g = sns.JointGrid(data=df, x="ATOI",y="G")
g1 = g.plot(sns.histplot, sns.histplot, binwidth=1)
g2 = g1.set_axis_labels("Average time on ice", "Goals per season")
We can observe that the graph contains 2 clusters (one goes from average time on ice = 10 up right and second one goes from the same position to the right).
In my opinion, the first cluster contains forwards and the second one contains defencemen, who often spend a lot of time on ice, however don't score as much as forwards. There is also some strange data at the left showing players who spend on average < 5 minutes on ice, but score many goals (over 10 or even 20 goals per season).
The following graph shows the same relation but this time only for forwards:
df_forwards = df.loc[df["Pos"].isin(["C","LW","RW"])]
g = sns.JointGrid(data=df_forwards, x="ATOI",y="G")
g1 = g.plot(sns.histplot, sns.histplot, binwidth=1)
g2 = g1.set_axis_labels("Average time on ice", "Goals per season")
We can see that the first cluster indeed mostly consists of forwards.
The graph also suggests that for forwards, we can formulate the following hypothesis: Number of scored goals directly depends on average time on ice (the higher ice time, the more goals).
The same graph but for defencemen, looks like this:
df_forwards = df.loc[df["Pos"] == "D"]
g = sns.JointGrid(data=df_forwards, x="ATOI",y="G")
g1 = g.plot(sns.histplot, sns.histplot, binwidth=1)
g2 = g1.set_axis_labels("Average time on ice", "Goals per season")
We can observe that most of the strange data (players who spend on average < 5 minutes on ice, but score many goals) from the graph above belong to defencemen.
After looking at these suspicious data, we see that the average time on ice is miscalculated.
If we divide total time on ice (TOI) by games played (GP), we should get average time on ice (ATOI). However, for these data ATOI is way smaller (mostly in the range 0-5) than the real value (see ATOI_real column in the table below). Therefore, it's an error in the data (for further data processing, we may recalculate ATOI values).
df_sus = df.loc[(df["G"] > 10) & (df["ATOI"] <= 2) & (df["Pos"] == "D"),["Player","Season","Pos","GP","G","TOI","ATOI"]]
df_sus["ATOI_real"] = df_sus["TOI"] / df_sus["GP"]
df_sus.head(5)
| Player | Season | Pos | GP | G | TOI | ATOI | ATOI_real | |
|---|---|---|---|---|---|---|---|---|
| 8 | Brent Burns | 2017 | D | 82 | 29 | 2039 | 0.866667 | 24.865854 |
| 15 | Victor Hedman | 2017 | D | 79 | 16 | 1936 | 0.500000 | 24.506329 |
| 107 | Roman Josi | 2017 | D | 72 | 12 | 1805 | 1.066667 | 25.069444 |
| 110 | Alex Pietrangelo | 2017 | D | 80 | 14 | 2023 | 1.283333 | 25.287500 |
| 148 | Shea Weber | 2017 | D | 78 | 17 | 1955 | 1.066667 | 25.064103 |
After exluding these error data, we see that for defencemen, there is a similar hypothesis: Higher average time on ice typically means more scored goals.